Authorship Attribution Based on Feature Set Subspacing Ensembles
نویسنده
چکیده
Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sparse data can be directly applied to solve this problem. This paper focuses on classifier ensembles based on feature set subspacing. It is shown that an effective ensemble can be constructed using, exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers. The simple model can be enhanced by a variation of the technique of cross-validated committees applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness of the presented method improving previously reported results and compare it to support vector machines, an alternative suitable machine learning approach to authorship attribution.
منابع مشابه
Authorship Attribution in Bengali Language
We describe Authorship Attribution of Bengali literary text. Our contributions include a new corpus of 3,000 passages written by three Bengali authors, an end-toend system for authorship classification based on character n-grams, feature selection for authorship attribution, feature ranking and analysis, and learning curve to assess the relationship between amount of training data and test accu...
متن کاملDiversity in Random Subspacing Ensembles
Ensembles of learnt models constitute one of the main current directions in machine learning and data mining. Ensembles allow us to achieve higher accuracy, which is often not achievable with single models. It was shown experimentally and theoretically that in order for an ensemble to be effective, it should consist of classifiers having diversity in their predictions. One technique, which prov...
متن کاملExplaining Delta, or: How do distance measures for authorship attribution work?
Authorship Attribution is a research area in quantitative text analysis concerned with attributing texts of unknown or disputed authorship to their actual author based on quantitatively measured linguistic evidence (see Juola 2006; Stamatatos 2009; Koppel et al. 2009). Authorship attribution has applications in literary studies, history, forensics and many other fields, e.g. corpus stylistics (...
متن کاملAn Off-the-shelf Approach to Authorship Attribution
Authorship detection is a challenging task due to many design choices the user has to decide on. The performance highly depends on the right set of features, the amount of data, in-sample vs. out-of-sample settings, and profilevs. instance-based approaches. So far, the variety of combinations renders off-the-shelf methods for authorship detection inappropriate. We propose a novel and generally ...
متن کاملVote/Veto Meta-Classifier for Authorship Identification - Notebook for PAN at CLEF 2011
For the PAN 2011 authorship identification challenge we have developed a system based on a meta-classifier which selectively uses the results of multiple base classifiers. In addition we also performed feature engineering based on the given domain of e-mails. We present our system as well as results on the evaluation dataset. Our system performed second and third best in the authorship attribut...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- International Journal on Artificial Intelligence Tools
دوره 15 شماره
صفحات -
تاریخ انتشار 2006